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Abstract. We propose a new method of estimation in high-dimensional linear regression model. It 
allows for very weak distributional assumptions including heteroscedasticity, and does not require the 
knowledge of the variance of random errors. The method is based on linear programming only, so that 
its numerical implementation is faster than for previously known techniques using conic programs, 
and it allows one to deal with higher dimensional models. We provide upper bounds for estimation 
and prediction errors of the proposed estimator showing that it achieves the same rate as in the more 
restrictive situation of fixed design and i.i.d. Gaussian errors with known variance. Following Gautier 
and Tsybakov (2011), we obtain the results under weaker sensitivity assumptions than the restricted 
eigenvalue or assimilated conditions. 

1. Introduction 
In this paper, we consider the linear regression model 

(1) yi = xff3* + Ui, i = l,...,n, 

where Xi are random vectors of explanatory variables in M p , and Uj G lis a random error. The aim is 
to estimate the vector /3* € M p from n independent, not necessarily identically distributed realizations 
(yi,xj), i = l,...,n. We are mainly interested in high-dimensional models where p can be much 
larger than n under the sparsity scenario where only few components /3| of f3* are non-zero (/3* is 
sparse) . 

The most studied techniques for high-dimensional regression under the sparsity scenario are 

the Lasso, the Dantzig selector, see, e.g., Candes and Tao (2007), Bickel, Ritov and Tsybakov (2009) 

(more references can be found in Buhlmann and van de Geer (2011) and Koltchinskii (2011)), and 

agregation by exponential weighting (see Dalalyan and Tsybakov (2008), Rigollet and Tsybakov (2011, 

2012) and the references cited therein). Most of the literature on high-dimensional regression assumes 

that the random errors are Gaussian or subgaussian with known variance (or noise level). However, 

l 
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quite recently several methods have been proposed which are independent of the noise level (see, e.g., 
Stadler, Biihlmann and van de Geer (2010), Antoniadis (2010), Belloni, Chernozhukov and Wang 
(2011a, 2011b), Gautier and Tsybakov (2011), Sun and Zhang (2011), Belloni, Chen, Chernozhukov, 
and Hansen (2012) and Dalalyan (2012)). Among these, the methods of Belloni, Chernozhukov and 
Wang (2011b), Belloni, Chen, Chernozhukov, and Hansen (2012), Gautier and Tsybakov (2011) allow 
to handle non-identically distributed errors Ui and are pivotal, i.e., rely on very weak distributional 
assumptions. In Gautier and Tsybakov (2011), the regressors x% can be correlated with the errors Uj, 
and an estimator is suggested that makes use of instrumental variables, called the STIV (Self- Tuned 
Instrumental Variables) estimator. In a particular instance, the STIV estimator can be applied in 
classical linear regression model where all regressors are uncorrelated with the errors. This yields a 
pivotal extension of the Dantzig selector based on conic programming. Gautier and Tsybakov (2011) 
also present a method to obtain finite sample confidence sets that are robust to non-Gaussian and 
heteroscedastic errors. 

Another important issue is to relax the assumptions on the model under which the validity of 
the Lasso type methods is proved, such as the restricted eigenvalue condition of Bickel, Ritov and 
Tsybakov (2009) and its various analogs. Belloni, Chernozhukov and Wang (2011b) obtain fast rates 
for prediction for the Square-root Lasso under a relaxed version of the restricted eigenvalue condition. 
In the context of known noise variance, Ye and Zhang (2011) introduce cone invertibility factors instead 
of restricted eigenvalues. For pivotal estimation, an approach based on the sensitivities and sparsity 
certificates is introduced in Gautier and Tsybakov (2011), see more details below. Finally, note that 
aggregation by exponential weighting (Dalalyan and Tsybakov (2008), Rigollet and Tsybakov (2011, 
2012)) does not require any condition on the model but its numerical realization is based on MCMC 
algorithms in high dimension whose convergence rate is hard to assess theoretically. 

In this paper, we introduce a new pivotal estimator, called the Self-tuned Dantzig estimator. 
It is defined as a linear program, so from the numerical point of view it is simpler than the previously 
known pivotal estimators based on conic programming. We obtain upper bounds on its estimation 
and prediction errors under weak assumptions on the model and on the distribution of the errors 
showing that it achieves the same rate as in the more restrictive situation of fixed design and i.i.d. 
Gaussian errors with known variance. The model assumptions are based on the sensitivity analysis 
from Gautier and Tsybakov (2011). Distributional assumptions allow for dependence between Xi and 
Ui. When x^s are independent from Uj's, it is enough to assume, for example, that the errors Ui are 
symmetric and have a finite second moment. 



2. Notation 

We set Y = (yi, . . . , y n ) T , U = (u±, . . . , u n ) T , and we denote by X the matrix of dimension 
nxp with rows xj, i = 1, . . . , n. We denote by D the pxp diagonal normalizing matrix with diagonal 
entries dkk > 0, k = 1, . . . ,p. Typical examples are: dkk = 1 or 

-1/2 



I 1 n \ 

dkk = -T^L > an d d kk = ( max |x fci 

\ n < / V t=l....,ra 



where x^j is the feth component of X{. For a vector /3 S M p , let = {k G {1, . . . ,p} : 7^ 0} 

be its support, i.e., the set of indices corresponding to its non-zero components (3k- We denote by 
I J| the cardinality of a set J C {1, . . . ,p} and by J c its complement: J c = {1, . . . ,p} \ J. The £ p 
norm of a vector A is denoted by |A| P , 1 < p < 00. For A = (Ai, . . . A P ) T € MP and a set of indices 
J C {1, . . . ,p}, we consider Aj = (Ai]l{ lg j}, . . . , A p ]l{ pG j}) T , where is the indicator function. 
For a € R, we set a+ = max(0, a), a^ 1 = (a + )~ 1 . 

3. The Estimator 

We say that a pair (/3,<r) £ R p x R + satisfies the Self-tuned Dantzig- constraint if it belongs to 

the set 

1 



-DX J (Y - X/3) 



< or 

00 



(2) 2? = ^ (/3, a) f3 G R p , cr > 0, 
for some r > (specified below). 

Definition 3.1. H^e ca/l the Self- Tuned Dantzig estimator any solution (f3,a) of the following mini- 
mization problem 

(3) min (ID^L+cct), 
for some positive constant c. 



Finding the Self- Tuned Dantzig estimator is a linear program. The term ca is included in the 
criterion to prevent from choosing a arbitrarily large. The choice of the constant c will be discussed 
later. 
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4. Sensitivity Characteristics 
The sensitivity characteristics are defined by the action of the matrix 

f„ = iDX T XD 

n 

on the so-called cone of dominant coordinates 

4 7) ^{AGR p : |A JC | 1 <(f+ 7 )|A J | 1 }, 

(7) 

for some 7 > 0. It is straightforward that for 5 6 Cj , 

(4) |A| 1 <(2 + 7 )|Aj| 1 <(2 + 7 )|J| 1 - 1 /«|A J | ? , VI < q < 00. 

We now recall some definitions from Gautier and Tsybakov (2011). For q € [l,oo], we define the 
sensitivity as the following random variable 

AeC™: \A\ q =l 

Given a subset Jq C {1, . . . ,p} and q £ [l,oo], we define the £ q -Jo-block sensitivity as 



(5) «$o,J- (7) inf l*" A l 

A e c« : |A Jo | 9 =i 



By convention, we set j = 00 ■ Also, recall that the restricted eigenvalue of Bickel, Ritov and 
Tsybakov (2009) is defined by 



( 7 ) a . , IA T *„A| 

re™™ t = int 



RE,J / v I a 1 2 

Agkp\{o} : Aecy' l^Jh 
and a closely related quantity is 

'(7) A f l-H|A T *nA| 

RE, J - mI M \X~A2 • 

AeiRp\{o}: Aecy> \ a J\i 

The next result establishes a relation between restricted eigenvalues and sensitivities. It follows directly 
from the Cauchy-Schwarz inequality and (Jl|). 

Lemma 4.1. 

(6) ^ < 4e!, < (2 + 7)I^S,j < (2 + 7)Vl-S- 

The following proposition gives a useful lower bound on the sensitivity. 



Proposition 4.2. If \J\ < s, 



S fc=l,...,p ^ A fc = l, |A|i<(2+ 7 )a 

Proof. We have 

Kw r = inf l*nAL 

A: |A. 7 | 1= 1, |A JC |i<l+7 00 

> inf l^nA^ 

A: lAloo^l, |A|!<2+ 7 

= - inf I^AI (by homogeneity) 

S A: lAloo^l, |A| 1 <(2+ 7 ) S I '°° V B J/ 

_ I i„f |AU» 

S A: lAloc^l, |A|i<(2+7)s |A|oo 

> - inf |^ n AI (by homogeneity) 

~ S A: |A|oo=l, |A|i<(2+7)s| A|oo 100 V 

S A: lAlo^l, |A|i<(2+ 7 )s °° 

= - min 1 inf |* n AI 1 . □ 

S k=l,...,p [A: A k =l, |A|i<(2+7> 00 J 

Note that the random variable k[ 7 q(s) depends only on the observed data. It is not difficult to see 
that it can be obtained by solving p linear programs. For more details and further results on the 
sensitivity characteristics, see Gautier and Tsybakov (2011). 

5. Bounds on the estimation and prediction errors 

In this section, we use the notation A = D _1 (/3 — f3). Let < a < 1 be a given constant. We 
choose the tuning parameter r in the definition of T> as follows: 



(8) r _ ./21og(4p/a) 



n 



Theorem 5.1. Let for all i = 1, . . . , n, and k = 1, . . . ,p, the random variables XkiUi be symmetric. 
Let Q* > be a constant such that 

( d 2 n \ 

(9) P max -**£a&u?>Q* < a/2. 

Assume that |J(/3*)| < s, and set in ([3p 

(2 7 + l)r 



(10) c 
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where 7 is a positive number. Then, with probability at least 1 — a, for any 7 > and any (3 such 
that (f3,a) is a solution of the minimization problem Q with c defined in U0\) we have the following 
bounds on the i\ estimation error and on the prediction error: 



(11) 
(12) 

Proof. Set 

and define the event 



( 7+ 2)(2 7 + W 
A^„A < ( (7 + 2)(27 + l)V\ r2 



rl 2 n 

k=i,....p n ' 



1 r 
-DX T U 



n 



Then 



e c c u 

fc=i,...,p 



^ki^i 



En 
i=l •Eki"'i 



and the union bound yields 

p 

(13) p(£ c )<]T: 



fe=i 



VT,i=l( x kiUi) 2 

En 



< rVQ(/?*), fc = l,...,p> • 



> \/nr 



vD"=i(zfc 



■it,- 



We now use the following result on deviations of self-normalized sums due to Efron (1969). 
Lemma 5.2. If rji, . . . , r\ n are independent symmetric random variables, then 



nt z 



>t\< 2exp ) , V t > 0. 

I ipn 2 / V 2 

For each of the probabilities on the right-hand side of f)13[) . we apply Lemma [5,21 with rji = x^Ui. 
This and the definition of r yield P(£? c ) < a/2. Thus, the event Q holds with probability at least 1— a/2. 
On the event Q we have 



(14) 
(15) 



n 



DX T (Y - X/3) 



+ 



r? 



DX T (Y - X/3* 



< r5 + 



-DX T U 

n 



(16) 



< r 



2VW)+ K-VW 



Inequality (|15p holds because belongs to the set X> by definition. Notice that, on the event Q, 

/3*, J Q(f3*) J belongs to the set T>. On the other hand, (Z?,?) minimizes the criterion JD" 1 /?^ + ca 



on the same set T>. Thus, on the event G, 



(17) 



+ ca< |D-V*|i + cVQ(/9*) 



This implies, again on the event £?, 



k£j{/3*) 



fceJ(/3*) c 



(18) <r^VQ()9*) + -|A JW . ) | 1 

where /5^,/5fc are the /cth components of /3*,/3. Similarly, (|17p implies that, on the event Q, 



A 



fcej(/3*) c 



< E (1^1- 

fceJ(/3*) 



(19) < IA^^ + cVQ^*). 

We now distinguish between the following two cases. 



Case 1: cy Q(/3*) < 7 (Aj^*)^. In this case (fT9l) implies 
(20) | A wli^ (1 + 7)^^.)^. 

Thus, A £ 011 the event Q. By definition of fi^ 7 /^,*, J(gt) and ([7]), 



I A I < l*" A l 



,J(/3*),J(/3*) 

|*nA| 



< 



l,J(/3*),J(/3*) 



This and CEH]) yield 



Case 2: cyj Q{fi*) > 7 |Aj ( ^)| r Then, obviously, jA^.^ < ±\JQ(P*). 
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Combining the two cases we obtain, on the event Q, 

2r (_ 



(21) |Aj G8 . ) | 1 <yQ(/3*)niax^ 




In this argument, c > and 7 > were arbitrary. The value of c given in (jlOp is the minimizer of the 
right-hand side of (f2Tj) . Plugging it in (j2Tj) we find that, with probability at least 1 — a/2 

( 7 + 2)(2 7 + l)r 

7*$ W 



where we have used (Tl9j) . Now, by (|9|), Q(/3*) < Q* with probability at least 1 — a/2. Thus, we get 
that (llip holds with probability at least 1 — a. Next, using (118p we obtain that, on the same event of 
probability at least 1 — a, 

(2 7 + l)r 



Combining this inequality with (jlip yields (|12p . □ 

Discussion of Theorem I5.il 

(1) In view of Lemma l4.lt K ijw*) j(/3*) — t) _2k re ^ so ' ^ is easy to see from 

Proposition 14.21 that «i q( s ) is of the order 1/s when ^ n is the identity matrix and p>s (this 
is preserved for that are small perturbations of the identity). Thus, the bounds (jlip and 
(1121) take the form 



n \ n 



for some constant C, and we recover the usual rates for the i\ estimation and for the prediction 
error respectively, cf. Bickel, Ritov and Tsybakov (2009). 
(2) Theorem 15.11 does not assume that x^'s are independent from t^'s. The only assumption is 
the symmetry of XkiUi. However, if x^i is independent from m, then by conditioning on Xki in 
the bound for P(£?), it is enough to assume the symmetry of u^s. Furthermore, while we have 
chosen the symmetry since it makes the conditions of Theorem 15.11 simple and transparent, it 
is not essential for our argument to be applied. The only point in the proof where we use the 
symmetry is the bound for the probability of deviations of self-normaized sums ¥{Q). This 
probability can be bounded in many other ways without the symmetry assumption, cf., e.g., 
Gautier and Tsybakov (2011). It is enough to have E[xfcjtij] = and a uniform over k control 
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of the ratio 

(S?=iEfefl) 1/2 

for some 5 > 0, cf. [H] or [BJ. 

(3) The quantity Q* is not present in the definition of the estimator and is needed only to assess 
the rate of convergence. It is not hard to find Q* in various situations. The simplest case 
is when d^k = 1 and the random variables x^i and m are bounded uniformly in k, i by a 
constant L. Then we can take Q* = L 4 . If only are bounded uniformly in k by L, 
condition Q holds when ^ {^Y,Z=l u i > Q*/L 2 ) < oc/2, and then for Q* to be bounded 
it is enough to assume that ttj's have a finite second moment. The same remark applies 
when dkk = (ma x i=i,...,n |^fc«|) _1 5 with an advantage that in this case we guarantee that Q* is 
bounded under no assumption on x^i- 

(4) The bounds in Theorem 15.11 depend on 7 > that can be optimized. Indeed, the functions 
of 7 on the right-hand sides of (jlip and ()12|) are data-driven and can be minimized on a grid 
of values of 7. Thus, we obtain an optimal (random) value 7 = 7, for which pip and (|12p 
remain valid, since these results hold for any 7 > 0. 
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